100 research outputs found
Large scale homophily analysis in twitter using a twixonomy
In this paper we perform a large-scale homophily analysis on Twitter using a hierarchical representation of users' interests which we call a Twixonomy. In order to build a population, community, or single-user Twixonomy we first associate "topical" friends in users' friendship lists (i.e. friends representing an interest rather than a social relation between peers) with Wikipedia categories. A wordsense disambiguation algorithm is used to select the appropriate wikipage for each topical friend. Starting from the set of wikipages representing "primitive" interests, we extract all paths connecting these pages with topmost Wikipedia category nodes, and we then prune the resulting graph G efficiently so as to induce a direct acyclic graph. This graph is the Twixonomy. Then, to analyze homophily, we compare different methods to detect communities in a peer friends Twitter network, and then for each community we compute the degree of homophily on the basis of a measure of pairwise semantic similarity. We show that the Twixonomy provides a means for describing users' interests in a compact and readable way and allows for a fine-grained homophily analysis. Furthermore, we show that midlow level categories in the Twixonomy represent the best balance between informativeness and compactness of the representation
A Topic Recommender for Journalists
The way in which people acquire information on events and form their own
opinion on them has changed dramatically with the advent of social media. For many
readers, the news gathered from online sources become an opportunity to share points
of view and information within micro-blogging platforms such as Twitter, mainly
aimed at satisfying their communication needs. Furthermore, the need to deepen the
aspects related to news stimulates a demand for additional information which is often
met through online encyclopedias, such as Wikipedia. This behaviour has also
influenced the way in which journalists write their articles, requiring a careful assessment
of what actually interests the readers. The goal of this paper is to present
a recommender system, What to Write and Why, capable of suggesting to a journalist,
for a given event, the aspects still uncovered in news articles on which the
readers focus their interest. The basic idea is to characterize an event according to
the echo it receives in online news sources and associate it with the corresponding
readers’ communicative and informative patterns, detected through the analysis of
Twitter and Wikipedia, respectively. Our methodology temporally aligns the results
of this analysis and recommends the concepts that emerge as topics of interest from
Twitter and Wikipedia, either not covered or poorly covered in the published news
articles
Modeling Quality and Machine Learning Pipelines through Extended Feature Models
The recently increased complexity of Machine Learning (ML) methods, led to
the necessity to lighten both the research and industry development processes.
ML pipelines have become an essential tool for experts of many domains, data
scientists and researchers, allowing them to easily put together several ML
models to cover the full analytic process starting from raw datasets. Over the
years, several solutions have been proposed to automate the building of ML
pipelines, most of them focused on semantic aspects and characteristics of the
input dataset. However, an approach taking into account the new quality
concerns needed by ML systems (like fairness, interpretability, privacy, etc.)
is still missing. In this paper, we first identify, from the literature, key
quality attributes of ML systems. Further, we propose a new engineering
approach for quality ML pipeline by properly extending the Feature Models
meta-model. The presented approach allows to model ML pipelines, their quality
requirements (on the whole pipeline and on single phases), and quality
characteristics of algorithms used to implement each pipeline phase. Finally,
we demonstrate the expressiveness of our model considering the classification
problem
Can Twitter be a source of information on allergy? Correlation of pollen counts with tweets reporting symptoms of allergic rhinoconjunctivitis and names of antihistamine drugs
Pollen forecasts are in use everywhere to inform therapeutic decisions for patients with allergic rhinoconjunctivitis (ARC). We exploited data derived from Twitter in order to identify tweets reporting a combination of symptoms consistent with a case definition of ARC and those reporting the name of an antihistamine drug. In order to increase the sensitivity of the system, we applied an algorithm aimed at automatically identifying jargon expressions related to medical terms. We compared weekly Twitter trends with National Allergy Bureau weekly pollen counts derived from US stations, and found a high correlation of the sum of the total pollen counts from each stations with tweets reporting ARC symptoms (Pearson's correlation coefficient: 0.95) and with tweets reporting antihistamine drug names (Pearson's correlation coefficient: 0.93). Longitude and latitude of the pollen stations affected the strength of the correlation. Twitter and other social networks may play a role in allergic disease surveillance and in signaling drug consumptions trends
Towards a Prediction of Machine Learning Training Time to Support Continuous Learning Systems Development
The problem of predicting the training time of machine learning (ML) models
has become extremely relevant in the scientific community. Being able to
predict a priori the training time of an ML model would enable the automatic
selection of the best model both in terms of energy efficiency and in terms of
performance in the context of, for instance, MLOps architectures. In this
paper, we present the work we are conducting towards this direction. In
particular, we present an extensive empirical study of the Full Parameter Time
Complexity (FPTC) approach by Zheng et al., which is, to the best of our
knowledge, the only approach formalizing the training time of ML models as a
function of both dataset's and model's parameters. We study the formulations
proposed for the Logistic Regression and Random Forest classifiers, and we
highlight the main strengths and weaknesses of the approach. Finally, we
observe how, from the conducted study, the prediction of training time is
strictly related to the context (i.e., the involved dataset) and how the FPTC
approach is not generalizable
- …